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Abstract. We prove a strong inapproximability result for the Balanced Minimum Evolution 
Problem. Our proof also implies that the problem remains NP-hard even when restricted 
to metric instances. Furthermore, we give a MST-based 2-approximation algorithm for the 
problem for such instances. 

1. Introduction 

Let [n] := {1, . . . ,n} be a set of n species. Let (5ij) be a n x n symmetric matrix with 
nonnegative entries and zeroes on the diagonal, where Sy represents the dissimilarity between 
species i and j. The Balanced Minimum Evolution Problem is to find a cubic tree T (every 
<^ . internal vertex has degree 3) with n leaves, together with a bijection between the leaves of T 

(1) f(T) : «%2 ] lL = £ <%2 2 * /; . ; 
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is minimized, where dij denotes the distance between the leaves for species i and j in T. We 

■ point out that our objective function is twice the length of the tree T, which is the commonly 
used objective function. 

This computational biology problem was introduced by Desper and Gascuel 0], inspired by 
work of Pauplin 0], and has been studied, e.g., in [l|, Q, H, 0, @]. Although no hardness proof 
for the problem has been published, it appears that it was known to be NP-hard since 2004 
(Guillemot @). To our knowledge, plain NP-hardness is the strongest hardness result known 
about the problem. In particular, the complexity of the Balanced Minimum Evolution Problem 
| is still open in case the dissimilarities are restricted to be 0/1, or to satisfy the triangle inequality. 

■ Furthermore, the problem is not known to be hard to approximate. 
First, in Section^ we start with preliminaries. Then, in Sectional we prove that the Balanced 

Minimum Evolution Problem does not admit any interesting approximation algorithm (unless 
P = NP): the problem is NP-hard to approximate to within a c n -factor for some constant c > 1. 
^ | Finally, in Section HJ we give a simple 2-approximation algorithm for the problem, in case the 

dissimilarities Sij satisfy the triangle inequality. By results of the previous section, the problem 
is NP-hard in this case. 

2. Preliminary Remarks and Observations 
2.1. Kraft's Inequality. Kraft's inequality for a binary tree with n leaves states that 

(2) ^ < 

i€[n] 

where di is the distance from the root r to the ith leaf. It is easy to prove that, if the tree 
is a binary cubic tree (meaning that all internal vertices have degree 3 except the root), then 
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equality holds in ([2]). This implies that for every feasible solution T to the Balanced Minimum 
Evolution Problem and for every fixed leaf j, 



(3) ^2 22 ~ dij = 2 - 



i6[n] 

2.2. The Objective as an Average Over Compatible Tours. A result of Semple and 
Steel @] states that 2 2 ~ di ^> is the probability that leaves i and j are consecutive in a (undirected) 
tour on the leaves of T chosen uniformly at random from the tours compatible with T, that is, 
such that the tree T can be embedded in the plane so that the tour visits the leaves of T in 
clockwise order. Thus, we have the following lemma. 

Lemma 2.1. For all feasible solutions T to the Balanced Minimum Evolution Problem, f(T) 
is the expected cost of a random tour compatible with T . 

In the light of Lemma 12.11 it should not surprise the reader that one can define the Balanced 
Minimum Evolution Problem over all trees T with n leaves, by defining f(T) as the expected 
cost of a tour picked uniformly at random from the tours compatible with T. Then, letting 
Pij = P%j(T) denote the unique i—j path in T (with vertex set V(Pij)) and 

(4) 7ry:=2 J] 



deg T (u) — 1 ' 



' 2J I 



one has 



i<j 

However, it is known that the 7r-matrices of non-cubic trees are convex combinations of the 
7r- matrices of cubic trees 0], hence for every non-cubic tree T there always exists a cubic tree 
T' with f(T') ^ f(T). In Section 01 we will give a polynomial time algorithm to find such a 
cubic tree T'. 

2.3. The "All 1" Case. Every solution is optimal in that case: 

Lemma 2.2. Suppose 5ij = 1 for every i,j with i ^ j. Then, for all feasible solutions T, 

f(T) = n. 

In particular, the optimum of the Balanced Minimum Evolution Problem is n. 
Proof 1. By ©, 

f(T) = 2 2 -^ = - E E 22 ~^ = ~ 2n = n 
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Proof 2. Every tour on the leaves of T has n edges, of cost 1 each. By Lemma 12.11 it follows 
that f(T) = n for all feasible solutions T. □ 

3. NP-Hardness and Inapproximability 

Theorem 3.1. There exists a constant c > 1 such that the Balanced Minimum Evolution 
Problem has no c n -approximation algorithm unless P = NP, where n denotes the number of 
species. This remains true even when all entries of the dissimilarity matrix are in {0, 1}. 

Proof. The reduction is from the 3-Colorability Problem: We are given a (simple, undirected) 
graph G on p vertices, and have to decide whether V(G) can be partitioned into three stable 
sets (recall that a stable set is a set of mutually non-adjacent vertices). 

We may assume without loss of generality that G contains two vertex-disjoint triangles. 
Indeed, if not it suffices to add twice three new vertices to G that form a triangle; this has 
clearly no influence on whether G is 3-colorable or not. 

2 



Let A be an arbitrary constant with 1/2 < A < 2/3. We will prove the claim with 

c . = 2 (2/3-A)(3-4A) > L 



Si 



(By taking A sufficiently close to 1/2, one has c ^ 1.12.) 
Let m be the number of edges in G. We may assume 

(5) m ^ 2W 3 -V p = 2 ( 2 /3-A)|V(G)| 

because otherwise G has bounded size and we can check whether G is 3-colorable using brute 
force. 

Define k as the smallest integer satisfying k ^ p/(2A — 1) and k = 1 (mod 3). Consider 
an arbitrary ordering V\,V2, ■ ■ ■ ,v p of the vertices of G. We define an instance of the Balanced 
Minimum Evolution Problem with n := p+k species as follows. The first p species are associated 
with the vertices of G: species i (for i £ [p]) corresponds to vertex Uj. The matrix (Sij) is defined 
by setting, for i / j, 

J 1 if i,j G [p] and v iVj G E(G), 
J%3 [ otherwise. 

Consider an optimal solution for the instance of the Balanced Minimum Evolution Problem 
described above. This solution is a cubic tree T with n leaves together with a bijection from 
the set of species to the set of leaves of T. For simplicity, we denote by V{ (i G [n]) the leaf 
of T associated to species i. (Thus, when i ^ p, denotes both the ith vertex of G and the 
corresponding leaf of T; which one is meant will be clear from the context.) The cost of this 
optimal solution is denoted OPT. Thus, we have 

(6) OPT = 22 ~ dlj 

where dij is the distance between species i and j in T. 
First we show: 

(7) If > Xk for all v^Vj G E{G) then G is 3-colorable. 

Consider an arbitrary triangle in G; without loss of generality we may assume that the vertices 
of this triangle are vi,V2,v^. Let C be the union of the v\-V2 path, the V\-v$ path, and the 
V2~ v 3 path in T. (Recall that there is unique path between two given vertices in a tree, thus C 
is well defined.) Then C is isomorphic to a subdivision of the claw and its three leaves are 
vi, V2, and V3. Let w be the unique vertex in C with degree 3. Let Pi (I € {1, 2, 3}) denote the 
path obtained from the V(-w path in T by removing w. Since v±, 1)2, V3 are pairwise adjacent in 
G, we have 

(8) \Pi\ + \P(i\ = dw > Xk 

for all £, £' € {1, 2, 3} with I 7^ £'. (\Pi\ stands for the number of vertices in Pi.) 

Let Ti {£ £ {1,2,3}) be the component of T — w containing vg, and let AQ be the set of 
internal vertices in T that are included in Tf*. Observe that T has n — 2 internal vertices and 
that all vertices of Pg {£ € {1,2,3}) are internal vertices of T, except for vg. Using (jHJ) and 
k ^ p/(2A — 1) we obtain 

(9) \Xi\ < (n-2) + l- ^2 \Pe\<n-Xk-l = (l-X)k + p-l< : Xk-l 

e' £{1,2,3}, e'^e 

for all £ G {1,2,3}. 

Let Si {£ G {1, 2, 3}) be the set of vertices Vi of G such that Vi is a leaf of T that is included in 
Tg. (Thus Si U 62 U 53 = V(G).) Every two vertices in Sg are at distance at most \Xg\ + 1 < Xk 
in T by ©. Therefore, Si, S2, S3 are stable sets of G, and G is 3-colorable. This proves ([7]). 

Next we prove: 

(a) if G is not 3-colorable then OPT ^ 2 2_Afc ; 

(b) if G is 3-colorable then OPT sC m • 2 2 "( 2fc+4 )/ 3 . 
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The first part of the above claim is a direct consequence of ([7]): If G is not 3-colorable then 
there is an edge ViVj of G such that dij ^ Xk, and hence OPT ^ 2 2 ~ di i ^ 2 2 ~ Xk . 

For the second part, let Si, S2, S3 denote the three color classes of a 3-coloring of G. Recall 
that G has two vertex-disjoint triangles, which implies |S^| ^2 for every I € {1, 2, 3}. We build 
a feasible solution T' from this coloring which will imply the desired upper bound on OPT. 

The tree T' is defined as follows. First, for each i € {1, 2, 3}, create a path Pi on (k — l)/3 + 
\Se\ — 1 vertices (here we use that k = 1 (mod 3)). Let and be the two endpoints of Pg. 
Create a new vertex w and make it adjacent to bi, bi, and 63. Next, attach a leaf to each vertex 
of degree 2 in the resulting tree, and attach two new leaves to each of ai, ci2 > an d «3- This 
defines the tree T". The species are placed in the following way on the leaves of T'\ for each 
i € {1,2,3}, put the species corresponding to vertices in Se on the \Sg\ leaves that are closest 
to ai in T', in an arbitrary way. The remaining k species are placed arbitrarily on the k leaves 
of T' that remain free. 

Since |<5^| is 2 for every £ € {1, 2, 3}, the two leaves adjacent to ai are associated with species 
in [p], and thus the first (k — l)/3 vertices of the path from bg to in T' are adjacent to leaves 
associated with species not in \p\. Hence, if i,j € [p] are species such that v^j E E(G), then 
they are at distance at least (A; — l)/3 + 1 + (k — l)/3 + 1 = (2/c + 4)/3 in T'. Therefore, the cost 
of this feasible solution is at most m ■ 2 2_ ( 2fc + 4 )/ 3 , implying OPT ^ m ■ 2 2_ ( 2fc + 4 )/ 3 as claimed. 

Now, since 

22-Afc 2( 2 / 3_A ) fc + 4 / 3 

m . 2 2-(2fc+4)/3 = ~ 
o(2/3-A)fc 
> 

m 

^ 2 (2/3-A)(fc-p) 
_ 2(2/3-A)(n-2p) 
^ 2 (2/3-A)(3-4A)n ( since p ^ ( 2 A - l)jfe < (2A - l)n) 

= c", 

it follows that a c n -approximation algorithm for the Balanced Minimum Evolution Problem 
could be used to decide whether G is 3-colorable or not. This concludes the proof. □ 

An instance of the Balanced Minimum Evolution Problem is said to be metric if the dissim- 
ilarity matrix (8ij) is a semimetric, that is, if the £y's satisfy 

hk ^ <% + Sjk 

for all distinct species i, j, k. 

Corollary 3.2. The Balanced Minimum Evolution Problem is NP-hard on metric instances. 
This remains true even if the non-diagonal entries of the dissimilarity matrix are all in {1,2}. 

Proof. By Theorem 13.11 the Balanced Minimum Evolution Problem is NP-hard when all dis- 
similarities are in {0, 1}. Consider such an instance and add 1 to every non-diagonal entry of 
the dissimilarity matrix (<%), giving a dissimilarity matrix (<&). Then (<5^-) is a semimetric, 
because 

for all distinct species i, j, k. 

Consider a feasible solution T to the instance, and let f(T, (%)) and f(T, (<5L)) denote 
the cost of the solution w.r.t. (Sij) and (^-)> respectively. Let (uij) := (<5^ ) — (<%)■ Then 
f(T, (5'tj)) = f(T, (S i:j )) + f(T, (uij)) = f(T, (S i:j )) + n by Lemma E2J It follows that a solution 
to the modified instance is optimal if and only if it is optimal for the original instance. □ 



(by ©) 
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4. A 2- Approximation Algorithm for Metric Instances 

In this section we assume that the dissimilarity matrix (Sij) is a semimetric. We describe a 
MST-based 2- approximation algorithm for this special case. 

4.1. Two Lower Bounds. Let TSP denote the cost of an optimal tour on the n species 
with respect to the costs Sy, and let again OPT denote the cost of an optimal solution to 
the Balanced Minimum Evolution Problem. By Lemma 12.11 because the average of a random 
variable is always at least the minimum value achieved by the random variable, we conclude 

OPT ^ TSP. 

Now let MST denote the cost of a minimum spanning tree on the species w.r.t. the costs S^. 
It is known that MST is a lower bound on TSP, thus also 

(10) OPT ^ MST. 



Algorithm 1 A 2-approximation algorithm for metric instances. 

1: Compute a minimum spanning tree To on the n species w.r.t. costs Sij. 
2: T i — T 

3: while there is a species i € V(T) that is not a leaf do 
4: Relabel internal vertex i as i' . 

5: Add new leaf to T adjacent to i! through a new edge of zero cost, label the leaf i. 
6: end while 

7: Find a feasible cubic tree T' with f(T')^ f(T). 
8: return T' 



4.2. The Algorithm and its Analysis. Consider Algorithm [T] above. 

First, it is clear that the cost of To, as a solution of the minimum spanning tree problem, is 
MST. 

Second, observe that the modifications performed on T in steps 3-6 induce an extended 
semimetric (Sij) defined over the whole vertex set of the final tree T. In this semimetric, for 
every leaf i that was moved to the exterior of the tree, we have Sn* = 0. 

Third, observe that the final tree T is an optimal solution of the minimum spanning tree 
problem with respect to the extended semimetric (Sij), of cost MST. Hence, every closed walk 
that visits each edge of T twice has cost 2MST. Since any tour on the leaves of T that is 
compatible with T can be obtained by shortcutting such a closed walk, every such tour has cost 
at most 2MST, because (Sij) is a semimetric. 

Fourth, by combining Lemma 12.11 and (|10|) , we conclude that Algorithm [T] returns a feasible 
solution T' whose cost is at most 2MST, hence at most 20PT. It follows from Lemma 14.11 
below that the whole algorithm, and in particular step 7, can be implemented so that its running 
time is polynomial. 

Lemma 4.1. Let T be any tree with n leaves, namely, the n species. Then one can find in 
polynomial time a feasible cubic tree T' with f(T') ^ f(T). 

Proof. Pick an internal vertex u with degree q > 3. Next, pick two neighbors v\ and V2 of u. Let 
rpviv 2 denote the tree obtained from T by adding a new internal vertex u 1 with neighborhood 
{u,v±,V2} and deleting v± and i>2 from the neighborhood of u. We claim that the 7r-matrix of 
T, as defined by ([3]), can be obtained as a convex combination of the 7r-matrices of the trees 
T VlV2 , where v\,V2 E Nt(u). In particular, there exists a pair v\, V2 such that f(T VlV2 ) ^ f(T). 
The lemma follows from the claim. 

In order to prove the claim, denote by (i^ij) the 7r-matrix of T and by (t^" 2 ) the 7r-matrix 
of T VlV2 . Consider a pair i, j of leaves of T. 

If u ^ Pij(T), then vr^ 1 " 2 = itij always. 
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Otherwise, u 6 Pij(T). Let n u , n u /, n uu > denote the number of pairs v±, V2 such that Pij(T vlV2 ) 
contains, respectively, u and not u', v! and not u, both u and u'. Then n u = 1, n uu > = 2(q — 2) 
and rv = (|) -n u - n u > = \q 2 - ~q + 3. 

Therefore, 



El Ult , 2 1/9-1 ,9-1 9-1 \ 

™.(T« = (i)(~ n " + ^""' + %^) n ""V 



{ui ,H2}CA r (n) 



9-1 , 9-1(1 2 5 \ 9 - 1 

+ ~ o I o9 - o« + 3 ) + ~ ^ 2 l9 - 2 ) ] 



V1V2 



q(q-l)\ 2 q-2 \2^ 2 1 ) 2(q 

2/1 1/1,5 \ 
" - q {2 + —2{2' ! 2'> + 3 ) +1 

2 1 (q-2 1 , 5 
= -qq-2\-r ^ 2^ 2 -^ + 3 + «- 2 )^ 

2 1 /l 2 \ 
99-2 V 2 / 

From what precedes, we infer that 

{v u v 2 }CN(u) 

for all pairs of leaves i, j. The claim, and the result follow. □ 
Our final result follows. 

Theorem 4.2. Algorithmic is a 2 -approximation algorithm for the Balanced Minimum Evolu- 
tion Problem. 

4.3. Tightness of the Lower Bounds. Consider the metric instances of the Balanced Min- 
imum Evolution Problem with 5u = 1 for all % > 1 and 5ij = 2 for all pairs such that i > 1 
and j > 1. For these instances, MST = n — 1. However, as it can be easily checked, we also 
have OPT 2n — 2. Hence, lim n _>oo jj^gy = 2 for this family of instances. This indicates 
that analyzing the approximation factor of an algorithm in terms of MST cannot yield a factor 
smaller than 2. 

We believe that this is even true when the stronger bound TSP is used, and make the 
following conjecture (backed by experimental evidence). 



Conjecture 4.3. The family of instances of the Balanced Minimum Evolution Problem in which 

OPT 
TSP 



(6ij) is the shortest path metric of a n-vertex cycle satisfies linin-j.oo wSp = 2. 
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